16 research outputs found
Holistic indoor scene understanding, modelling and reconstruction from single images.
3D indoor scene understanding in computer vision refers to perceiving the semantic and geometric information in a 3D indoor environment from partial observations (e.g. images or depth scans). Semantics in a scene generally involves the conceptual knowledge such as the room layout, object categories, and their interrelationships (e.g. support relationship). These scene semantics are usually coupled with object and room geometry for 3D scene understanding, for example, layout plan (i.e. location of walls, ceiling and floor), shape of in-room objects, and a camera pose of observer. This thesis focuses on the problem of holistic 3D scene understanding from single images to model or reconstruct the in- door geometry with enriched scene semantics. This challenging task requires computers to perform equivalently as human vision system to perceive and understand indoor contents from colour intensities. Existing works either focus on a sub-problem (e.g. layout estimation, 3D detection or object reconstruction), or ad- dressing this entire problem with independent subtasks, while this thesis aims to an integrated and unified solution toward semantic scene understanding and reconstruction. In this thesis, scene semantics and geometry are regarded inter- twined and complementary. Understanding each part (semantics or geometry) helps to perceive the other one, which enables joint scene understanding, modelling & reconstruction. We start by the problem of semantic scene modelling. To estimate the object semantics and shapes from a single image, a feasible scene modelling streamline is proposed. It is backboned with fully convolutional networks to learn 2D semantics and geometry, and powered by a top-down shape retrieval for object modelling. After this, We build a unified and more efficient visual system for semantic scene modelling. Scene semantics are divided into relational (i.e. support relationship) and non-relational (i.e. object segmentation & geometry, room layout) knowledge. A Relation Network is proposed to estimate the support relations between objects to guide the object modelling process. Afterwards, We focus on the problem of holistic and end-to-end scene understanding and reconstruction. Instead of modelling scenes by top-down shape retrieval, this method bridges the gap between scene understanding and object mesh reconstruction. It does not rely on any external CAD repositories. Camera poses, room lay- out, object bounding boxes and meshes are end-to-end predicted from an RGB image with a single network architecture. At the end, We extend our work by using a different input modality, single-view depth scan, to explore the object reconstruction performance. A skeleton-bridged approach is proposed to predict the meso-skeleton of shapes as an intermediate representation to guide surface reconstruction, which outperforms the prior-arts in shape completion. Overall, this thesis provides a series of novel approaches towards holistic 3D indoor scene understanding, modelling and reconstruction. It aims at automatic 3D scene perception that enables machines to understand and predict 3D contents as human vision, which we hope could advance the boundaries of 3D vision in machine perception, robotics and Artificial Intelligence
Learning 3D Scene Priors with 2D Supervision
Holistic 3D scene understanding entails estimation of both layout
configuration and object geometry in a 3D environment. Recent works have shown
advances in 3D scene estimation from various input modalities (e.g., images, 3D
scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models),
for which collection at scale is expensive and often intractable. To address
this shortcoming, we propose a new method to learn 3D scene priors of layout
and shape without requiring any 3D ground truth. Instead, we rely on 2D
supervision from multi-view RGB images. Our method represents a 3D scene as a
latent vector, from which we can progressively decode to a sequence of objects
characterized by their class categories, 3D bounding boxes, and meshes. With
our trained autoregressive decoder representing the scene prior, our method
facilitates many downstream applications, including scene synthesis,
interpolation, and single-view reconstruction. Experiments on 3D-FRONT and
ScanNet show that our method outperforms state of the art in single-view
reconstruction, and achieves state-of-the-art results in scene synthesis
against baselines which require for 3D supervision.Comment: Video: https://youtu.be/YT7MEdygRoY Project:
https://yinyunie.github.io/sceneprior-page
Pose2Room: Understanding 3D Scenes from Human Activities
With wearable IMU sensors, one can estimate human poses from wearable devices
without requiring visual input~\cite{von2017sparse}. In this work, we pose the
question: Can we reason about object structure in real-world environments
solely from human trajectory information? Crucially, we observe that human
motion and interactions tend to give strong information about the objects in a
scene -- for instance a person sitting indicates the likely presence of a chair
or sofa. To this end, we propose P2R-Net to learn a probabilistic 3D model of
the objects in a scene characterized by their class categories and oriented 3D
bounding boxes, based on an input observed human trajectory in the environment.
P2R-Net models the probability distribution of object class as well as a deep
Gaussian mixture model for object boxes, enabling sampling of multiple,
diverse, likely modes of object configurations from an observed human
trajectory. In our experiments we show that P2R-Net can effectively learn
multi-modal distributions of likely objects for human motions, and produce a
variety of plausible object structures of the environment, even without any
visual information. The results demonstrate that P2R-Net consistently
outperforms the baselines on the PROX dataset and the VirtualHome platform.Comment: Accepted by ECCV'2022; Project page:
https://yinyunie.github.io/pose2room-page/ Video:
https://www.youtube.com/watch?v=MFfKTcvbM5
ME-PCN: Point Completion Conditioned on Mask Emptiness
Point completion refers to completing the missing geometries of an object
from incomplete observations. Main-stream methods predict the missing shapes by
decoding a global feature learned from the input point cloud, which often leads
to deficient results in preserving topology consistency and surface details. In
this work, we present ME-PCN, a point completion network that leverages
`emptiness' in 3D shape space. Given a single depth scan, previous methods
often encode the occupied partial shapes while ignoring the empty regions (e.g.
holes) in depth maps. In contrast, we argue that these `emptiness' clues
indicate shape boundaries that can be used to improve topology representation
and detail granularity on surfaces. Specifically, our ME-PCN encodes both the
occupied point cloud and the neighboring `empty points'. It estimates
coarse-grained but complete and reasonable surface points in the first stage,
followed by a refinement stage to produce fine-grained surface details.
Comprehensive experiments verify that our ME-PCN presents better qualitative
and quantitative performance against the state-of-the-art. Besides, we further
prove that our `emptiness' design is lightweight and easy to embed in existing
methods, which shows consistent effectiveness in improving the CD and EMD
scores.Comment: Accepted to ICCV 2021; typos correcte
NerVE: Neural Volumetric Edges for Parametric Curve Extraction from Point Cloud
Extracting parametric edge curves from point clouds is a fundamental problem
in 3D vision and geometry processing. Existing approaches mainly rely on
keypoint detection, a challenging procedure that tends to generate noisy
output, making the subsequent edge extraction error-prone. To address this
issue, we propose to directly detect structured edges to circumvent the
limitations of the previous point-wise methods. We achieve this goal by
presenting NerVE, a novel neural volumetric edge representation that can be
easily learned through a volumetric learning framework. NerVE can be seamlessly
converted to a versatile piece-wise linear (PWL) curve representation, enabling
a unified strategy for learning all types of free-form curves. Furthermore, as
NerVE encodes rich structural information, we show that edge extraction based
on NerVE can be reduced to a simple graph search problem. After converting
NerVE to the PWL representation, parametric curves can be obtained via
off-the-shelf spline fitting algorithms. We evaluate our method on the
challenging ABC dataset. We show that a simple network based on NerVE can
already outperform the previous state-of-the-art methods by a great margin.
Project page: https://dongdu3.github.io/projects/2023/NerVE/.Comment: Accepted by CVPR2023. Project page:
https://dongdu3.github.io/projects/2023/NerVE
Surgical Instruction Generation with Transformers
Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling multimodal context
Data-driven train set crash dynamics simulation
© 2016 Informa UK Limited, trading as Taylor & Francis GroupTraditional finite element (FE) methods are arguably expensive in computation/simulation of the train crash. High computational cost limits their direct applications in investigating dynamic behaviours of an entire train set for crashworthiness design and structural optimisation. On the contrary, multi-body modelling is widely used because of its low computational cost with the trade-off in accuracy. In this study, a data-driven train crash modelling method is proposed to improve the performance of a multi-body dynamics simulation of train set crash without increasing the computational burden. This is achieved by the parallel random forest algorithm, which is a machine learning approach that extracts useful patterns of force–displacement curves and predicts a force–displacement relation in a given collision condition from a collection of offline FE simulation data on various collision conditions, namely different crash velocities in our analysis. Using the FE simulation results as a benchmark, we compared our method with traditional multi-body modelling methods and the result shows that our data-driven method improves the accuracy over traditional multi-body models in train crash simulation and runs at the same level of efficiency
Shallow2Deep: Indoor scene modeling by single image understanding
Dense indoor scene modeling from 2D images has been bottlenecked due to the absence of depth information and cluttered occlusions. We present an automatic indoor scene modeling approach using deep features from neural networks. Given a single RGB image, our method simultaneously recovers semantic contents, 3D geometry and object relationship by reasoning indoor environment context. Particularly, we design a shallow-to-deep architecture on the basis of convolutional networks for semantic scene understanding and modeling. It involves multi-level convolutional networks to parse indoor semantics/geometry into non-relational and relational knowledge. Non-relational knowledge extracted from shallow-end networks (e.g. room layout, object geometry) is fed forward into deeper levels to parse relational semantics (e.g. support relationship). A Relation Network is proposed to infer the support relationship between objects. All the structured semantics and geometry above are assembled to guide a global optimization for 3D scene modeling. Qualitative and quantitative analysis demonstrates the feasibility of our method in understanding and modeling semantics-enriched indoor scenes by evaluating the performance of reconstruction accuracy, computation performance and scene complexity
Semantic modeling of indoor scenes with support inference from a single photograph
We present an automatic approach for the semantic modeling of indoor scenes based on a single photograph, instead of relying on depth sensors. Without using handcrafted features, we guide indoor scene modeling with feature maps extracted by fully convolutional networks. Three parallel fully convolutional networks are adopted to generate object instance masks, a depth map, and an edge map of the room layout. Based on these high-level features, support relationships between indoor objects can be efficiently inferred in a data-driven manner. Constrained by the support context, a global-to-local model matching strategy is followed to retrieve the whole indoor scene. We demonstrate that the proposed method can efficiently retrieve indoor objects including situations where the objects are badly occluded. This approach enables efficient semantic-based scene editing
PatchComplete: Learning Multi-Resolution Patch Priors for 3D Shape Completion on Unseen Categories
While 3D shape representations enable powerful reasoning in many visual and
perception applications, learning 3D shape priors tends to be constrained to
the specific categories trained on, leading to an inefficient learning process,
particularly for general applications with unseen categories. Thus, we propose
PatchComplete, which learns effective shape priors based on multi-resolution
local patches, which are often more general than full shapes (e.g., chairs and
tables often both share legs) and thus enable geometric reasoning about unseen
class categories. To learn these shared substructures, we learn
multi-resolution patch priors across all train categories, which are then
associated to input partial shape observations by attention across the patch
priors, and finally decoded into a complete shape reconstruction. Such
patch-based priors avoid overfitting to specific train categories and enable
reconstruction on entirely unseen categories at test time. We demonstrate the
effectiveness of our approach on synthetic ShapeNet data as well as challenging
real-scanned objects from ScanNet, which include noise and clutter, improving
over state of the art in novel-category shape completion by 19.3% in chamfer
distance on ShapeNet, and 9.0% for ScanNet.Comment: Video link: https://www.youtube.com/watch?v=Ch1rvw2D_Kc ; Project
page: https://yuchenrao.github.io/projects/patchComplete/patchComplete.htm